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Comparison  Between  Self-Guided  Langevin  Dynamics 
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Refinement  of  Protein  Loop  Conformations 
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This  article  presents  a  comparative  analysis  of  two  replica- 
exchange  simulation  methods  for  the  structure  refinement  of 
protein  loop  conformations,  starting  from  low-resolution 
predictions.  The  methods  are  self-guided  Langevin  dynamics 
(SGLD)  and  molecular  dynamics  (MD)  with  a  Nose-Hoover 
thermostat.  We  investigated  a  small  dataset  of  8-  and  12- 
residue  loops,  with  the  shorter  loops  placed  initially  from  a 
coarse-grained  lattice  model  and  the  longer  loops  from  an 
enumeration  assembly  method  (the  Loopy  program).  The 
CHARMM22  +  CMAP  force  field  with  a  generalized  Born 
implicit  solvent  model  (molecular-surface  parameterized 
GBSW2)  was  used  to  explore  conformational  space.  We  also 
assessed  two  empirical  scoring  methods  to  detect  nativelike 
conformations  from  decoys:  the  all-atom  distance-scaled  ideal- 
gas  reference  state  (DFIRE-AA)  statistical  potential  and  the 
Rosetta  energy  function.  Among  the  eight-residue  loop 
targets,  SGLD  out  performed  MD  in  all  cases,  with  a  median  of 
0.48  A  reduction  in  global  root-mean-square  deviation  (RMSD) 


of  the  loop  backbone  coordinates  from  the  native  structure. 
Among  the  more  challenging  12-residue  loop  targets,  SGLD 
improved  the  prediction  accuracy  over  MD  by  a  median  of 
1.31  A,  representing  a  substantial  improvement.  The  overall 
median  RMSD  for  SGLD  simulations  of  12-residue  loops  was 
0.91  A,  yielding  refinement  of  a  median  2.70  A  from  initial 
loop  placement.  Results  from  DFIRE-AA  and  the  Rosetta  model 
applied  to  rescoring  conformations  failed  to  improve  the 
overall  detection  calculated  from  the  CHARMM  force  field.  We 
illustrate  the  advantage  of  SGLD  over  the  MD  simulation 
model  by  presenting  potential-energy  landscapes  for  several 
loop  predictions.  Our  results  demonstrate  that  SGLD 
significantly  outperforms  traditional  MD  in  the  generation  and 
populating  of  nativelike  loop  conformations  and  that  the 
CHARMM  force  field  performs  comparably  to  other  empirical 
force  fields  in  identifying  these  conformations  from  the 
resulting  ensembles.  Published  2011  Wiley  Periodicals,  lnc.+ 
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Introduction 

Refinement  of  comparative  protein  structures  is  of  significant  in¬ 
terest  given  the  rapid  decoding  of  new  sequences  from  large- 
scale  genomic  efforts  and  the  desire  to  model  three-dimen¬ 
sional  protein  structures  to  accurate  resolution.  Notable  exam¬ 
ples  of  computational  refinement  of  protein  models  include  the 
work  of  Levitt  and  coworkers/1-33  the  work  of  the  Baker  lab[4_6] 
and  Zhu  et  al.,[7]  the  work  from  the  Skolnick  group, [8,9]  Jacobson 
and  coworkers/1 0]  and  Chen  and  Brooks/1 1]  among  others  that 
participated  in  recent  Critical  Assessment  of  Protein  Structure 
Prediction  meetings/123  An  integral  component  of  structure 
refinement  is  the  modeling  of  protein  loops.  One  of  the  more 
exigent  issues  is  how  to  improve  the  conformational  sampling 
of  all-atom  simulation  methods  to  Tunnel"  structures  to  the 
native  loop  basin  on  a  vast  energy  landscape  starting  from  low- 
resolution  model  predictions.  A  second  issue  is  the  develop¬ 
ment  of  scoring  functions  to  detect  the  native  conformation 
among  a  large  set  of  loop  decoys. 

Molecular  dynamics  (MD)  simulations  for  conformational  sam¬ 
pling  have  often  been  employed  for  final-stage  refinement  of  a 
predicted  structure  with  promising  results/2,1 1_13]  The  most  sig¬ 
nificant  challenge  to  MD-based  structure  refinement  is  that  the 
energy  landscape  of  protein  conformational  space  contains 
many  potential-energy  barriers  that  present  kinetic  traps  to  find¬ 
ing  the  native  basin.  One  method  designed  to  address  this  is 


temperature-based  replica  exchange  (T-ReX),[14]  which  uses 
multiple  parallel  simulations  at  a  range  of  temperatures  to  over¬ 
come  local  minima  and  promote  large  conformational  excur¬ 
sions  along  the  potential-energy  surface.  Although  T-Rex  is  fre¬ 
quently  used  with  MD  simulations  and  has  enjoyed  some 
success  in  structure  refinement,  it  is  often  insufficient,  particu¬ 
larly  when  refining  longer  loops  or  more  complex  systems.  An 
alternative  to  MD,  yet  relatively  untested  method  for  improving 
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sampling  is  based  on  self-guided  Langevin  dynamics  (SGLD) 
simulations.1153  The  SGLD  method  differs  from  the  standard  Lan¬ 
gevin  equation  by  an  introduction  of  ad  hoc  guiding  force.  This 
force  term  is  calculated  as  a  local  average  of  the  friction  forces 
during  a  SGLD  simulation  and  is  thought  to  accelerate  low-fre¬ 
quency  modes  that  hinder  transitions  across  high  potential- 
energy  barriers.  Wu  and  Brooks  demonstrated  through  several 
model  systems  that  the  SGLD  simulation  method  can  provide 
enhanced  conformational  sampling  of  an  energy  surface  with¬ 
out  significant  alteration  in  conformational  distribution. C15] 

In  this  work,  we  seek  to  combine  SGLD  with  T-ReX  for  the 
structure  refinement  of  loops.  Our  goal  is  not  to  provide  an  ex¬ 
haustive  benchmarking  of  the  SGLD/T-ReX  method,  but  rather 
to  contrast  its  performance  with  that  of  the  more  conventional 
MD/T-ReX  simulations  using  a  small  dataset  of  loop  targets. 
This  dataset  contains  eight-residue  loops,  which  are  relatively 
tractable  refinement  problem  for  a  number  of  loop  modeling 
algorithms,  and  longer  12-residue  loops,  which  are  almost  uni¬ 
versally  challenging  for  structure  prediction  and  refinement/163 

In  addition  to  exploring  the  SGLD  method,  we  also  revisit  the 
problem  of  detection  of  nativelike  structures  among  decoys[17] 
by  evaluating  three  scoring  methods.  The  first  is  based  on  the 
force  field  and  generalized  Born  implicit  solvent  model  used  to 
generate  the  loop  conformations.  The  second  approach  is 
rescoring  the  conformations  by  the  all-atom  distance-scaled 
ideal-gas  reference  state  (DFIRE-AA)  statistical  potential  func- 
tion.[18]  The  third  approach  is  the  Rosetta  all-atom  energy  func- 
tion.[19]  In  our  study,  side  chains  in  the  loop  stem  are  modeled 
to  be  flexible  during  the  simulations  and  thus  replicate  the  inex¬ 
act  local  environments  found  in  real-world  refinement  of  com¬ 
parative  protein  models.  Using  the  Rosetta  method,  we  examine 
the  possible  benefit  of  repacking  all  side  chains  and  their  energy 
optimization.  Motivation  for  applying  the  latter  scoring  method 
is  the  observation  from  previous  studies  that  all-atom  stimula¬ 
tions  typically  generate  nativelike  backbone  conformations,  yet 
placement  of  the  side  chains  from  thermal  sampling  often  leads 
to  a  poorly  defined  energy  funnel  to  the  native  basin.[20] 

Computational  Methods 

Simulation  models 

As  detailed  by  Wu  and  Brooks/153  the  scheme  of  self-guided 
simulations  is  to  enhance  conformational  sampling  by  incorpo¬ 
rating  information  extracted  from  the  trajectory  during  the 
simulation.  The  information  is  typically  a  local  property  ave¬ 
raged  over  the  adjoining  protein  conformational  space  near 
the  current  conformation  of  the  simulation  trajectory.  An  ear¬ 
lier  development  of  this  idea  is  self-guided  MD  simulations/213 
where  time-averaged  forces  are  applied  as  a  guiding  term.  For 
the  SGLD,  time-averaged  momentum  is  applied  and  has  the 
effect  of  accelerating  low-frequency  modes.  The  equation  of 
motion  for  an  SGLD  simulation  is 

p,  =  f#*-7/p/  +  R/  +  /lg/,  (1) 

where  p,  is  the  rate  of  change  of  the  momentum  of  particle  /, 
f,  is  the  force  acting  on  the  particle,  y-,  is  the  friction  constant, 


R,  denotes  the  random  force,  and  g,  is  a  memory  function, 
which  is  scaled  by  guiding  factor  2.  The  memory  function  g,  is 
defined  by  the  moving  average  of  the  momentum  seen  by 
the  system  over  an  interval  of  time,  L : 

9/  =  <P /)(.,  (2) 

where  (•••)/_  denotes  a  local  average.  The  time  interval  is  fur¬ 
ther  defined  as  L  =  tL/St,  where  tL  is  the  local  averaging  time 
and  St  the  time  step  along  the  simulation  trajectory.  Equation 
(2)  indicates  that  the  guiding  force  is  a  local  average  of  the 
friction  force  and  should  increase  the  chances  of  sampling  the 
native  or  lowest  energy  topology  of  a  protein  by  preferentially 
increasing  the  speed  of  the  slowest  conformational  motions. 
However,  as  noted  by  Wu  and  Brooks,  the  SGLD  equation  of 
motion  is  fundamentally  approximate.  This  approximate  nature 
was  investigated  in  a  recent  study  of  the  SGLD  method  used 
to  model  protein  folding-unfolding  transitions  and  it  was 
observed  that  the  main  drawback  of  incorporating  the  ad  hoc 
force  term  is  possible  distortion  of  the  free-energy  surface 
applied  to  the  calculation  of  thermodynamic  observables/223 
Here,  we  investigate  the  applicability  of  the  SGLD  method  to 
model  conformational  changes  that  are  inherently  much 
smaller  than  protein  folding  and  whether  the  method  can  do 
so  without  incurring  significant  distortions  in  the  distribution 
of  potential  energies. 

Our  SGLD  model  simulations  were  carried  out  using  the 
CHARMM22  force  field  with  the  CMAP  backbone  dihedral 
cross-term  extension/23,243  The  friction  constant  y  was  set  to  1 
ps-1  for  all  heavy  atoms,  the  guiding  factor  2  set  to  a  value  of 
1,  and  the  averaging  time  tL  was  set  to  1  ps.  Selection  of  these 
values  was  taken  from  our  previous  study  of  the  SGLD 
model/223  For  comparison  purposes,  MD  simulations  were 
applied  using  a  Nose-Hoover  thermostat  with  a  temperature 
coupling  constant  of  50  kcal  s-2.  An  integration  time  step  of  2 
fs  was  used  for  all  simulations.  Nonbonded  interaction  cutoff 
parameters  for  electrostatics  and  vdW  terms  were  set  at  a  ra¬ 
dius  of  22  A  with  a  2-A  potential  switching  function.  Covalent 
bonds  between  the  heavy  atoms  and  hydrogens  were  con¬ 
strained  by  the  SHAKE  algorithm/253  For  modeling  the  protein 
stem  outside  of  the  loop  segment  and  to  prevent  unfolding  at 
higher  temperatures,  Ca  and  Cp  atom  coordinates  were  teth¬ 
ered  to  their  initial  crystallographic  positions  with  a  force  con¬ 
stant  of  1.0  kcal  (mol-1  A-2). 

To  model  electrostatic  solvent  effects,  we  used  the  molecu¬ 
lar-surface-based  generalized  Born  switching-window  (GBSW2) 
solvent  model/263  This  implicit  solvent  model  was  parameter¬ 
ized  to  fit  the  Lee-Richards  molecular-surface  Poisson  results 
and  requires  model  parameters  to  be  set  to  values  of  w  =  0.2 
A,  a0  =  1.2045,  and  a t  =  0.1866.  The  hydrophobic  cavitation 
energy  term  was  approximated  by  a  linear  product  of  the  sol- 
vent-exposed  surface  area  of  the  solute  and  a  phenomenologi¬ 
cal  surface  tension  coefficient  set  to  30  cal  (mol-1  A-2). 

The  application  of  the  GBSW2  model  is  in  contrast  to  the  ear¬ 
lier  calculations  where  the  GBMV2  implicit  solvent  model  was 
used/203  This  revision  in  our  simulation  approach  has  several 
potential  advantages.  It  has  been  shown  by  Chocholousova 
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and  Feig[26]  that  GBSW2  exhibits  good  agreement  with  Poisson 
calculated  solvent  energies  and,  because  of  the  existence  of 
fewer  higher  frequency  components  in  the  GBSW2  model  as 
compared  with  GBMV2,  the  protein-solvent  dielectric  transition 
is  less  abrupt.  The  advantage  of  improved  smoothness  of  the 
dielectric  boundary  should  allow  greater  excursions  on  rugged 
conformational  energy  landscapes  by  increasing  transmission 
probabilities  across  high  potential-energy  barriers.  In  addition, 
more  stringent  energy  conservation  should  be  obtained  by 
GBSW2  using  the  Nose-Hoover  and  Langevin  thermostats,  and 
subsequently  yield  improvement  in  sampling  convergence.1263 
As  a  practical  note,  GBSW2  is  more  computationally  efficient 
than  GBMV2  due  to  the  calculational  scheme  of  determining 
the  Born  radii.  Despite  these  advantages,  one  possible  short¬ 
coming  of  GBSW2  compared  with  GBMV2  is  the  introduction  of 
artifacts  in  conformational  landscapes  computed  for  thermody¬ 
namic  protein  folding.[27]  While  this  disadvantage  may  have  an 
effect  in  some  cases  of  structure  refinement,  where  large 
perturbations  are  required  of  the  nature  of  unfolding-folding 
transitions,  it  is  most  probably  negligible  for  modeling  medium 
size  loops. 

Replica-exchange  simulations  were  performed  using  the 
MMTSBC28]  utilities  and  programming  libraries  for  implementing 
the  CHARMM  simulation  program  (version  c33b2).[29]  Simula¬ 
tions  were  carried  out  over  a  total  of  4-ns  simulation  time  for 
each  replica,  generating  a  final  culled  population  of  64,000  loop 
conformations  for  each  loop  target  using  16  replicas  with  the 
temperature  range  of  298-400  K.  Frequency  of  replica 
exchanges  was  set  to  every  1  ps  of  simulation.  The  starting 
input  structures  for  the  protein  targets  into  the  T-ReX  simula¬ 
tions  were  obtained  from  two  different  methods.  For  modeling 
eight-residue  loops,  the  starting  structures  were  generated 
from  a  low-resolution  cubic  lattice  model  and  details  of  the  sim¬ 
ulation  protocol  are  given  in  earlier  work.[20]  Initial  placement 
of  12-residue  loop  coordinates  is  based  on  predictions  using 
the  enumeration  scheme  of  the  Loopy  program  developed  by 
the  Honig  lab.[30]  The  top-scoring  structure  from  Loopy  is  used 
as  input  to  SGLD  and  MD  simulations  to  search  conformational 
space  of  locating  more  optimal  loop  conformations.  Our  test 
set  consists  of  five  8-residue  loops  and  six  12-residue  loops 
taken  from  a  diverse  set  of  protein  structures.[16,30] 

Scoring  of  protein  structures 

Three  different  scoring  functions  were  applied  to  select  the 
"best'  loop  conformation  from  the  ensemble  of  conformers 
generated  by  the  simulation  models.  The  first  is  identical  to 
the  force  field  (CHARMM22  +  CMAP  with  the  GBSW2  model) 
used  to  generate  the  loop  decoys.  Here,  we  define  the  scoring 
function  as 

G  =  L/jnt  +  Gsoiv  —  kB7"ln  M,  (3) 

where  each  loop  conformation  is  evaluated  as  the  sum  of  the 
internal  potential  energy,  L/int,  and  the  GBSW2  solvent  energy, 
Gsoiv/  plus  a  term  that  accounts  for  the  multiplicity  of  confor¬ 
mations/203  M,  for  a  cluster  of  loop  structures  at  absolute  tem¬ 
perature  T,  and  where  /cB  is  the  Boltzmann  constant.  Culled 


conformations  from  the  T-ReX  simulations  were  clustered  on 
the  basis  of  pairwise  backbone  root-mean-square  deviation 
(RMSD)  distances  (described  below).  A  hierarchical  clustering 
scheme  was  applied  that  includes  an  agglomerative  approach 
with  automatic  stopping  criteria.  Specific  details  of  our  cluster¬ 
ing  approximation  are  given  in  previous  work/203 

In  addition  to  culling  structures  at  the  specific  temperature 
of  298  K  from  T-ReX  for  direct  scoring  using  eq.  (3),  we  used 
the  weighted  histogram  analysis  method  (WHAM)[31]  to  calcu¬ 
late  the  probability  density  of  conformational  states  as  a  func¬ 
tion  of  the  total  conformation  energy  (L/int  +  GSOiv)  and  RMSD 
from  the  X-ray  crystallographic  structure.  In  our  WHAM  calcula¬ 
tions,  conformations  from  all  16  replicas  were  applied  and  we 
report  free  energies  calculated  for  T  =  298  K. 

The  second  energy  scoring  function  is  the  DFIRE-AA  statisti¬ 
cal  potential183  and  is  defined  as 

NobsjiJ,  r)  {4) 

Nobs(i,j,  hut ) 

where  /  and  j  are  non-hydrogen  atom  types,  r  is  a  pairwise  dis¬ 
tance,  rcut  is  the  cutoff  beyond  which  pairwise  interactions  are 
neglected,  Ar  is  the  histogram  bin  size,  Nobs  is  a  cumulative  his¬ 
togram  of  the  observed  occurrence  of  pairs  as  a  function  of  the 
pairwise  distance,  and  a  is  set  to  1.61  based  on  an  empirical 
analysis  of  hard-sphere  protein-like  spatial  distributions.  The  his¬ 
tograms  A/obs  in  this  work  were  obtained  from  previous  analysis 
of  a  culled  set  of  1836  Protein  Data  Bank  (PDB)  structures  which 
had  better  than  1.8-A  resolution  and  were  less  than  30%  homol¬ 
ogous  to  each  other/173  We  deviated  from  the  original  DFIRE 
protocol  by  assigning  Ar  =  0.5  A  at  all  distances  and  having  r 
range  from  0.25  to  14.75  A,  such  that  rcut  =  15  A. 

The  third  scoring  approach  is  application  of  the  Rosetta 
energy  function.  The  challenge  in  using  the  Rosetta  energy 
function  to  score  loop  decoys  selected  from  a  298-K  ensemble 
generated  from  the  CHARMM22  +  CMAP/GBSW2  force  field  is 
that  they  are  suboptimal  structures  under  the  Rosetta  energy 
function.  Our  general  approach  is  twofold:  first,  to  improve 
suboptimal  structures  in  CHARMM-generated  in  Rosetta  by 
allowing  limited  sampling  to  identify  a  local  minimum  in  the 
Rosetta  energy  landscape  near  a  given  decoy  structure;  sec¬ 
ond,  to  modify  the  Rosetta  energy  function  to  accommodate 
suboptimal  structural  features  that  cannot  be  improved  by  the 
limited  sampling.  Toward  these  ends,  we  developed  a  simple 
protocol  using  Rosetta  v3.1[19]  that  optimizes  hydrogen  place¬ 
ment,  packs  side  chains,  and  minimizes  and  calculates  the 
energy  under  a  modified  Rosetta  energy  function. 

For  each  decoy,  the  following  protocol  was  followed.  First, 
we  used  Rosetta  to  replace  hydrogen  atoms  using  standard 
bond  geometry  derived  from  the  CHARMM19  force  field/323 
We  then  used  the  Rosetta  fixed-backbone  packing  applica¬ 
tion^33  to  optimize  the  decoy  side-chain  conformations  under 
the  Rosetta  energy  function  using  the  backbone-dependant 
rotamer  library  developed  by  Dunbrack  and  Cohen[34]  that  was 
expanded  to  include  rotamers  at  ±1  standard  deviation  from 
the  standard  %  values  at  xi  and  /2a[35]  as  well  as  extra  rotamers 
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Table  1.  Structure  refinement  of  loop  conformations  using  replica-exchange  MD  and  SGLD  simulations 

Model 

Starting  loop 

RMSD 

%  Loops  sampled 
RMSD  <  2  A 

Lowest  sampled 
RMSD 

Force-field 

detection  RMSD 

DFIRE-AA 
scoring  RMSD 

Rosetta  scoring 
RMSD 

8-Residue  loops 

1  lit:82— 89 

2.21 

MD 

24 

0.87 

1.92 

2.23 

5.03 

SGLD 

67 

1.19 

1.44 

2.18 

1.48 

1  plc:6— 1 3 

4.58 

MD 

18 

0.56 

7.30 

7.92 

7.07 

SGLD 

68 

0.44 

0.66 

0.79 

0.52 

1awd:56-63 

3.61 

MD 

99 

0.27 

0.79 

0.52 

0.61 

SGLD 

100 

0.26 

0.68 

0.63 

0.45 

1  hfc:1 19-126 

3.09 

MD 

99 

0.33 

1.01 

0.61 

0.73 

SGLD 

95 

0.35 

0.80 

0.96 

0.87 

1  rro:18-25 

2.31 

MD 

84 

0.44 

1.39 

0.81 

0.78 

SGLD 

82 

0.46 

0.63 

1.12 

0.85 

12-Residue  loops 

1  bkf:9-20 

2.70 

MD 

21 

1.37 

2.43 

2.46 

1.98 

SGLD 

34 

0.37 

0.88 

2.45 

2.16 

1  ayh:21  —32 

4.30 

MD 

0 

1.91 

4.55 

4.35 

2.74 

SGLD 

6 

1.22 

2.64 

1.27 

2.63 

1  cex:23-34 

1.50 

MD 

100 

0.38 

0.82 

0.68 

0.59 

SGLD 

100 

0.40 

0.62 

0.61 

0.63 

1  akz:1 81—1 92 

2.20 

MD 

<1 

1.89 

3.12 

2.93 

1.99 

SGLD 

0 

2.24 

3.83 

2.97 

2.68 

1531:98-109 

2.80 

MD 

84 

1.08 

1.66 

1.83 

1.81 

SGLD 

89 

0.36 

0.86 

1.96 

1.33 

1arb:74-85 

2.70 

MD 

25 

1.17 

2.00 

1.99 

2.14 

SGLD 

88 

0.61 

0.93 

0.67 

0.85 

All  RMSD  values  are  in  units  of  Angstrom. 

at  y3  and  /4.[36]  The  initial  decoy  side-chain  conformations 
were  also  added  to  the  library.  Following  packing,  the  struc¬ 
ture  was  energy  minimized  along  backbone  and  side-chain 
torsion  angles  using  a  gradient-based  single-line  minimization 
scheme  and  scored  under  the  Rosetta  energy  function,  using 
the  scoring  application  in  Rosetta  v3.1. 

The  Rosetta  energy  function,  ERosettai  consists  of  a  linear  com¬ 
bination  of  energy  terms  that  represent  van  der  Waals  interac¬ 
tions,  residue-residue  interactions,  solvation,  hydrogen-bonding, 
and  side-chain  and  backbone  conformational  energies: 

^Rosetta  =  0.8£uattr  +  0.44£|jrep  +  0.49£pajr  +  0.65£SO|V 
"T  0.58£|-|B-short(bb-bb)  “F  "I  •  ”1  ^^HB-long(bb-bb)  “F  1  1  ^^HB(bb-sc) 

“F  ^hb(sc-sc)  ~F  0.56 Erot(aa  (p  ^  +  0.2£^(aa)  +  £aa(<p?1/,),  (5) 

where  van  der  Waals  interactions  are  represented  by  the 
attractive  and  repulsive  parts  of  a  modified  12-6  Lennard- 
Jones  potential  (£Uatti»  £ureP)-  Residue-residue  interactions  are 
modeled  using  a  residue  pairwise  potential  (£pair)  derived  from 
PDB  statistics.  Solvation  is  modeled  using  the  Lazaridis-Karplus 
implicit  solvation  model  (£SOiv)-[37]  Flydrogen-bonding  is  mod¬ 
eled  using  an  orientation-dependent  potential  parameterized 


from  quantum  mechanics  calculations1383  and  PDB  statistics1393 
for  short-range  (bond  length  <  2.6  A)  and  long-range  (bond 
length  >  2.6  A)  backbone-backbone  hydrogen  bonds  (£Hb- 
short(bb-bb),  ^HB-iong(bb-bb))/  backbone-side-chain  hydrogen 
bonds  (£HB(bb-sc))/  and  side-chain-side-chain  hydrogen 
bonds  (£hb(sc-sc))-[40]  Side-chain  and  backbone  conformational 
energies  are  represented  by  statistical  potentials  derived 
from  amino-acid  and  backbone-dependent  rotamer  proba¬ 
bilities  (£rot(aa,^,^))/  amino-acid-dependent  cpl\\j  angle  proba¬ 
bilities  (£^(33))/  and  <pfy  angle-dependent  amino-acid 
probabilities  (£aa(^)),  derived  from  PDB  statistics.135,36,403 

Weights  of  the  individual  terms  in  eq.  (5)  were  obtained 
from  an  updated  and  modified  version  of  score12,  the  weight- 
set  originally  used  for  all-atom  structural  refinement^53 
Compared  with  the  standard  score12  weight-set,[19]  we 
removed  the  "pro_close"  and  "omega"  energy  terms,  which  are 
statistical  potentials  reflecting  proline-ring  strain  energy  and 
backbone  co  torsional  energy.  Near-native  structures  generated 
under  the  CHARMM  force  field  showed  high  energies  along 
these  statistical  energy  terms  compared  with  crystal  structures, 
which  significantly  impeded  decoy  discrimination. 
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Figure  1.  Conformational  energy  landscape  for  eight-residue  loop  target  1  plc:6-1 3  starting  from  an  initial  placement  of  4.58  A  from  the  native,  a)  WHAM 
profile  evaluated  at  temperature  298  K  (red  color  represents  high  population  density  and  blue  denotes  low  density),  b)  WHAM  profile  for  MD  simulation, 
c)  DFIRE-AA  rescoring  of  SGLD  generated  conformations,  d)  DFIRE-AA  rescoring  of  MD  generated  conformations,  e)  Rosetta  rescoring  of  SGLD  generated 
conformations,  f)  Rosetta  rescoring  of  MD  generated  conformations. 


Evaluation  metrics 

Both  clustering  and  structure  prediction  evaluation  used  the  global 
RMSD  of  loop  backbone  atoms  between  a  decoy  and  a  reference 
structure.  This  was  calculated  by  superpositioning  the  backbone 
atoms  of  the  loop  stem  residues,  defined  as  residues  flanking  the 
loop,  of  the  decoy  with  that  of  the  reference  structure,  and  then  cal¬ 
culating  the  RMSD  between  the  backbone  atoms  of  the  loop  resi¬ 
dues  with  those  of  the  reference  structure.  For  hierarchical  clustering, 
the  reference  structure  was  another  decoy;  for  evaluating  structure 
predictions,  the  reference  structural  was  the  crystal  structure. 

Results  and  Discussion 

Table  1  summarizes  the  simulation  results  for  structure  refinement 
of  8-  and  12-residue  loops  using  the  two  simulation  models.  All 


computed  RMSD  values  are  for  global  displacements  of  the  loop 
backbone  coordinates  between  a  predicted  structure  and  the  X- 
ray  crystallographic  structure.  Culled  conformations  were 
extracted  at  a  temperature  of  298  K  and  were  clustered  on  the  ba¬ 
sis  of  pairwise  RMSD  distances  using  a  hierarchical  clustering 
scheme.[20,28]  Selection  of  the  eight-residue  loop  targets  for  assess¬ 
ing  the  model  calculations  was  taken  from  previous  work,C20]  which 
demonstrated  the  challenge  of  all-atom  simulations  to  efficiently 
populate  native  basins  using  the  CHARMM22  force  field. 

Our  calculations  for  the  loop  targets  show  the  SGLD  simula¬ 
tion  model  to  produce  more  accurate  structure  refinement 
than  the  MD  model.  For  the  eight-residue  loops,  the  sampled 
lowest  RMSD  conformations  calculated  by  SGLD  and  MD 
simulations  are  roughly  comparable  in  finding  nativelike 
basins;  however,  SGLD  generally  performs  better  in  clustering 
conformers  to  yield  low-RMSD  predictions.  For  the  task  of 
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Figure  2.  Conformational  energy  landscape  for  12-residue  loop  target  1bkf:9-20  starting  from  an  initial  placement  of  2.70  A  from  the  native,  a)  WHAM  pro¬ 
file  evaluated  at  temperature  298  K  for  SGLD  simulation.  Color  spectrum  similar  to  that  listed  in  Figure  1.  b)  WHAM  profile  for  MD  simulation,  c)  DFIRE-AA 
rescoring  of  SGLD.  d)  DFIRE-AA  rescoring  of  MD.  e)  Rosetta  rescoring  of  SGLD.  f)  Rosetta  rescoring  of  MD. 


detecting  nativelike  structures,  SGLD  produces  an  average  RMSD 
of  0.84  A  across  the  five  targets  in  the  eight-residue  loop  dataset 
and  MD  yields  a  statistical  average  RMSD  of  2.48  A,  while  previ¬ 
ously  MD  using  GBMV2  showed  2.56  A.[20]  The  GBSW2  provided 
a  better  model  for  populating  basins  below  2  A  than  GBMV2, 
but  both  GB  models  produced  largely  similar  results  of  detection. 

To  illustrate  the  distinction  between  SGLD  and  MD  for  a 
loop  target  where  the  conventional  sampling  method  strug¬ 
gles,  Figure  1  shows  two-dimensional  probability  density  con¬ 
tour  maps  of  the  potential  energy  versus  RMSD  at  298  K  for 
protein  with  PDB  ID:  1  pic.  We  define  the  potential  energy  as 
the  CHARMM22  +  CMAP  energy  plus  the  GBSW2  solvent 
energy.  The  key  observation  of  the  SGLD  model  is  the  sharp 
and  narrow  cluster  of  conformers  that  funnels  toward  the 
native  basin,  yielding  structure  refinement  of  an  initial  4.6-A 
backbone  RMSD  to  a  final  0.7-A  conformation.  The  corre¬ 


sponding  MD-computed  landscape  is  strongly  bifurcated 
between  near-native  (~2.5-A  RMSD)  and  non-native  loops 
(~6-8  A),  with  the  non-native  basin  showing  greater  popula¬ 
tion  density.  The  comparison  between  the  two  models  sug¬ 
gests  that  the  guiding  force  term  in  SGLD  provided  an  exter¬ 
nal  boost  with  the  net  effect  of  accelerating  transitions  across 
a  topological  barrier  at  roughly  1.7  A,  whereas  traditional  MD 
seems  to  be  locally  trapped  in  less-accurate  neighboring 
basins.  Because  the  potential-energy  function  is  identical  for 
MD  and  SGLD,  the  difference  in  the  results  of  Figure  1  reflects 
differences  in  sampling  convergence  of  the  two  simulation 
models.  Theoretically,  executing  the  MD  simulation  much  lon¬ 
ger  will  eventually  produce  results  similar  to  SGLD,  and  thus 
the  latter  approach  would  appear  to  exhibit  a  distinct  advant¬ 
age.  Below,  we  highlight  individual  targets  that  are  representa¬ 
tive  of  the  overall  results. 
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Figure  3.  Conformational  energy  landscape  for  12-residue  loop  target  1  ayh:21  -32  starting  from  an  initial  placement  of  4.30  A  from  the  native,  a)  WHAM 
profile  evaluated  at  temperature  298  K  for  SGLD  simulation.  Color  spectrum  similar  to  that  listed  in  Figure  1.  b)  WHAM  profile  for  MD  simulation,  c)  DFIRE- 
AA  rescoring  of  SGLD.  d)  DFIRE-AA  rescoring  of  MD.  e)  Rosetta  rescoring  of  SGLD.  f)  Rosetta  rescoring  of  MD. 


For  rescoring  by  the  empirical  potentials  of  the  lowest  tem¬ 
perature  replica  client  conformations  generated  for  eight-resi¬ 
due  loop  targets,  DFIRE-AA  yields  a  statistical  average/median 
RMSD  of  1.41/0.63  A  for  SGLD  and  2.42/0.81  A  for  MD.  Rosetta 
produces  similar  results  of  0.83/0.85  A  for  SGLD  and  2.01/0.78 
A  for  MD.  Figure  1  illustrates  DFIRE-AA  and  Rosetta  evaluations 
of  the  SGLD  and  MD  models  for  target  1  pic. 

Given  the  promising  outcome  of  refining  the  eight-residue 
loops  using  SGLD  simulations,  we  next  evaluate  this  model  for 
the  more  difficult  12-residue  loops.  This  small  dataset  was 
selected  from  an  earlier  reported  study  of  modeling  loops 
using  the  Protein  Local  Optimization  Program  (PL0P).[41]  A 
summary  of  the  results  in  Table  1  for  the  12-residue  loop  tar¬ 
gets  shows  for  SGLD  a  statistical  average/median  sampled  low¬ 
est  RMSD  basin  to  be  0.87/0.51  A  and  detection  to  a  RMSD  of 


1.63/0.91  A.  The  corresponding  MD  model  results  are  1.30/1.27 
A  and  2.43/2.22  A,  respectively.  For  comparison  purposes,  the 
average  and  median  starting  structure  predicted  from  Loopy  is 
2.70  A  and  the  reported  PLOP  predictions  are  2.95/2.99  A, 
where  in  both  cases  the  protein  stem  of  the  loop  region  was 
modeled  as  rigid.C30,41]  It  should  be  noted  that  the  PLOP 
method  has  undergone  recent  improvements  in  conforma¬ 
tional  sampling  and  detection  accuracy  applied  to  different 
loop  target  datasets.[41_43]  Results  for  DFIRE-AA  are  the  mean/ 
median  values  of  1.66/1.62  A  for  SGLD  and  2.37/2.23  A  for  MD, 
and  for  Rosetta,  the  corresponding  values  are  1.71/1.75  A  for 
SGLD  and  1.88/1.98  A  for  MD. 

To  further  demonstrate  the  comparison  between  SGLD  and 
MD  simulations,  we  show  in  Figures  2  and  3  the  probability 
density  distribution  profiles  for  loop  targets  Ibkf  and  layh. 
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The  outcome  from  these  two  loops  provides  a  range  of  results 
probably  to  be  observed  from  modeling  a  much  larger  dataset 
of  targets.  For  Ibkf,  the  profile  computed  by  MD  simulation 
shows  at  2.5-2.8  A  RMSD  a  large  conformational  free-energy 
surface  that  encompasses  the  starting  structure  predicted  by 
Loopy.  Budding  from  this  basin  is  a  less-populated  cluster  near 
2  A.  The  SGLD  simulation  produced  a  similar  large  basin  at 
2.6-A  RMSD,  yet  a  nativelike  basin  emerged  near  a  RMSD  value 
of  1  A.  This  result  illustrates  the  enhanced  sampling  provided 
by  the  SGLD  model,  whereas  the  MD  simulation  is  principally 
confined  to  exploring  local  regions  around  the  starting  loop 
conformation.  Rescoring  the  conformations  by  DFIRE-AA  show 
some  funnel-like  behavior;  however,  the  basin  at  roughly  2.5  A 
is  scored  too  favorably. 

In  a  similar  fashion,  calculations  for  layh  show  the  conven¬ 
tional  MD  method  confined  mostly  to  the  starting  loop  confor¬ 
mation  at  ~4.5-A  RMSD  with  some  excursions  to  lower  and 
higher  RMSD  basins.  Unlike  the  MD  model,  the  SGLD  simula¬ 
tions  traversed  a  potential-energy  barrier  at  a  RMSD  of  ~2  A. 
Although  both  models  failed  to  yield  high-resolution  refine¬ 
ment,  the  SGLD  model  produced  an  energy  landscape  that  is 
highlighted  by  diffusive  sampling  across  multiple  major  basins. 
Rescoring  the  298  K  conformations  by  DFIRE-AA  yields  a  RMSD 
funnel  for  the  SGLD  model  and  provides  detection  to  1.27  A, 
whereas  rescoring  the  MD  conformers  favors  the  starting  ba¬ 
sin.  On  the  other  hand,  Rosetta  fails  to  create  funnel  shapes 
for  both  simulation  models  and  detection  is  greater  than  2  A. 

It  is  worth  noting  the  energy  differences  between  scoring 
the  X-ray  crystallographic  structure  and  the  lowest  energy  con- 
former  for  the  three  scoring  approaches.  For  the  comparison, 
the  X-ray  structure  was  subjected  to  energy  minimization 
using  the  CHARMM22/GBSW2  force  field  and  then  scored  for 
all  loop  targets.  With  one  exception,  alternative  conformations 
generated  by  the  simulations  proved  to  be  more  favorable 
when  scored  by  the  CHARMM22/GBSW2  model  than  their  cor¬ 
responding  energy-minimized  X-ray  structures.  By  contrast, 
both  DFIRE-AA  and  Rosetta  favored  the  X-ray  structures. 
Although  this  result  is  not  entirely  surprising  given  the  signifi¬ 
cant  parameterization  of  the  empirical  models  using  PDB 
structures,  it  does  reflect  a  mismatch  of  resolution  between 
empirical  and  physics-based  scoring  methods.  Conformations 
generated  by  the  simulations  were  culled  from  a  nonadiabatic 
excursion  of  the  energy  surface  and  their  geometries  probably 
deviate  from  ideal  distributions  that  empirical  functions  are 
parameterized  against.  Contributing  to  this  is  possible  artifacts 
due  to  the  implicit  solvent  model  and  having  a  fixed-charge 
model  rather  than  a  flexible-charge  model. [44]  It  is  disappoint¬ 
ing  that  for  low-RMSD  backbone  structures  (<2  A)  a  sharp 
energy  funnel  was  not  created  by  an  approach  of  repacking 
the  side  chains  from  Rosetta.  This  result  may  be  attributed,  in 
part,  to  the  resolution  of  the  rotamer  library  and  the  lack  of 
energy  minimization  to  effectively  redistribute  the  population 
landscape. 

We  conclude  that  while  our  benchmark  of  loop  targets  is  lim¬ 
ited,  the  trend  is  clear  that  SGLD  provides  more  accurate  struc¬ 
ture  refinement  than  traditional  MD  and  achieves  a  median  1.3- 
A  resolution  increase  in  the  backbone  conformation  of  a  starting 


structure  2.7-A  RMSD  from  the  crystallographic  coordinate  place¬ 
ment.  Given  the  approximate  nature  of  the  guiding  force  in  the 
SGLD  equation  of  motion  and  its  possible  failing  of  producing  a 
rigorous  canonical  ensemble, [27]  one  may  have  a  priori  antici¬ 
pated  some  corruption  in  the  conformational  energy  distribution 
in  comparison  with  the  MD  model  using  a  Nose-Hoover  ther¬ 
mostat.  Instead,  we  found  the  potential  energies  from  WHAM  to 
be  comparable  between  the  two  simulation  models.  Neverthe¬ 
less,  Wu  and  Brooks  have  very  recently  reported  a  computa¬ 
tional  strategy  to  convert  a  self-guided  ensemble  to  a  canonical 
ensemble,  using  a  reweighting  technique. [45]  Application  of  their 
formulism  to  complex  protein  systems  await  further  testing.  In 
this  work,  the  subset  of  targets  with  similar  outcome  provides 
well-sampled  conformational  systems  where  the  two  sampling 
methods  are  proven  to  converge  and  thus  help  validate  the 
SGLD  method.  Where  SGLD  excels  is  enhanced  sampling  effi¬ 
ciency  for  loop  targets  where  MD  simulations  become  kinetically 
trapped.  Combining  with  T-Rex,  SGLD  offers  a  better  choice  for 
structure  refinement  of  low-resolution  models.  We  also  observed 
no  significant  advantage  to  rescoring  conformations  from  the 
simulations  by  empirical  all-atom  energy  functions. 
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